26 research outputs found

    A Review of Subsequence Time Series Clustering

    Get PDF
    Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies

    A Novel Two-Stage Spectrum-Based Approach for Dimensionality Reduction: A Case Study on the Recognition of Handwritten Numerals

    Get PDF
    Dimensionality reduction (feature selection) is an important step in pattern recognition systems. Although there are different conventional approaches for feature selection, such as Principal Component Analysis, Random Projection, and Linear Discriminant Analysis, selecting optimal, effective, and robust features is usually a difficult task. In this paper, a new two-stage approach for dimensionality reduction is proposed. This method is based on one-dimensional and two-dimensional spectrum diagrams of standard deviation and minimum to maximum distributions for initial feature vector elements. The proposed algorithm is validated in an OCR application, by using two big standard benchmark handwritten OCR datasets, MNIST and Hoda. In the beginning, a 133-element feature vector was selected from the most used features, proposed in the literature. Finally, the size of initial feature vector was reduced from 100% to 59.40% (79 elements) for the MNIST dataset, and to 43.61% (58 elements) for the Hoda dataset, in order. Meanwhile, the accuracies of OCR systems are enhanced 2.95% for the MNIST dataset, and 4.71% for the Hoda dataset. The achieved results show an improvement in the precision of the system in comparison to the rival approaches, Principal Component Analysis and Random Projection. The proposed technique can also be useful for generating decision rules in a pattern recognition system using rule-based classifiers

    Potential of support-vector regression for forecasting stream flow

    Get PDF
    Vodotok je važan za hidrološko proučavanje zato što određuje varijabilnost vode i magnitudu rijeke. Inženjerstvo vodnih resursa uvijek se bavi povijesnim podacima i pokušava procijeniti prognostičke podatke kako bi se osiguralo bolje predviđanje za primjenu kod bilo kojeg vodnog resursa, na pr. projektiranja vodnog potencijala brane hidroelektrana, procjene niskog protoka, i održavanja zalihe vode. U radu se predstavljaju tri računalna programa za primjenu kod rješavanja ovakvih sadržaja, tj. umjetne neuronske mreže - artificial neural networks (ANNs), prilagodljivi sustavi neuro-neizrazitog zaključivanja - adaptive-neuro-fuzzy inference systems (ANFISs), i support vector machines (SVMs). Za stvaranje procjene korištena je Rijeka Telom, smještena u Cameron Highlands distriktu Pahanga, Malaysia. Podaci o dnevnom prosječnom protoku rijeke Telom, kao što su količina padavina i podaci o vodostaju, koristili su se za period od ožujka 1984. do siječnja 2013. za podučavanje, ispitivanje i ocjenjivanje izabranih modela. SVM pristup je dao bolje rezultate nego ANFIS i ANNs kod procjenjivanja dnevne prosječne fluktuacije vodotoka.Stream flow is an important input for hydrology studies because it determines the water variability and magnitude of a river. Water resources engineering always deals with historical data and tries to estimate the forecasting records in order to give a better prediction for any water resources applications, such as designing the water potential of hydroelectric dams, estimating low flow, and maintaining the water supply. This paper presents three soft-computing approaches for dealing with these issues, i.e. artificial neural networks (ANNs), adaptive-neuro-fuzzy inference systems (ANFISs), and support vector machines (SVMs). Telom River, located in the Cameron Highlands district of Pahang, Malaysia, was used in making the estimation. The Telom River’s daily mean discharge records, such as rainfall and river-level data, were used for the period of March 1984 – January 2013 for training, testing, and validating the selected models. The SVM approach provided better results than ANFIS and ANNs in estimating the daily mean fluctuation of the stream’s flow

    A New Dataset Size Reduction Approach for PCA-Based Classification in OCR Application

    No full text
    A major problem of pattern recognition systems is due to the large volume of training datasets including duplicate and similar training samples. In order to overcome this problem, some dataset size reduction and also dimensionality reduction techniques have been introduced. The algorithms presently used for dataset size reduction usually remove samples near to the centers of classes or support vector samples between different classes. However, the samples near to a class center include valuable information about the class characteristics and the support vector is important for evaluating system efficiency. This paper reports on the use of Modified Frequency Diagram technique for dataset size reduction. In this new proposed technique, a training dataset is rearranged and then sieved. The sieved training dataset along with automatic feature extraction/selection operation using Principal Component Analysis is used in an OCR application. The experimental results obtained when using the proposed system on one of the biggest handwritten Farsi/Arabic numeral standard OCR datasets, Hoda, show about 97% accuracy in the recognition rate. The recognition speed increased by 2.28 times, while the accuracy decreased only by 0.7%, when a sieved version of the dataset, which is only as half as the size of the initial training dataset, was used

    Clustering of large time-series datasets using a multi-step approach / Saeed Reza Aghabozorgi Sahaf Yazdi

    Get PDF
    Various data mining approaches are currently being used to analyse data within different domains. Among all these approaches, clustering is one of the most-used approaches, which is typically adopted in order to group data based on their similarities. The data in various systems such as finance, healthcare, and business, are stored as time-series. Clustering such complex data can discover patterns which have valuable information. Time-series clustering is not only useful as an exploratory technique but also as a subroutine in more complex data mining algorithms. As a result, time-series clustering (as a part of temporal data mining research) has attracted increasing interest for use in various areas such as medicine, biology, finance, economics, and in the Web. Several studies which focus on time-series clustering have been conducted in said areas. Many of these studies focus on the time complexity of time-series clustering in large datasets and utilize dimensionality reduction approaches and conventional clustering algorithms to address the problem. However, as is the case in many systems, conventional clustering approaches are not practical for time-series data because they are essentially designed for static data and not for time-series data, which leads to poor clustering accuracy. Adequate clustering approaches for time-series are therefore lacking. In this thesis, the problem of the low quality in existing works is taken into account, and a new multi-step clustering model is proposed. This model facilitates the accurate clustering of time-series datasets and is designed specifically for very large time-series datasets. It overcomes the limitations of conventional clustering algorithms in dealing with time-series data. In the first step of the model, data is pre-processed, represented by symbolic aggregate approximation, and grouped approximately by a novel approach. Then, the groups are refined in the second step by using an accurate clustering method, and a representative is defined for each cluster. Finally, the representatives are merged to construct the ultimate clusters. The model is then extended as an interactive model where the results garnered by the user increase in accuracy over time. In this work, the accurate clustering based on shape similarity is performed. It is shown that clustering of time-series does not need to calculate the exact distances/similarity between all time-series in a dataset; instead, by using prototypes of similar time-series, accurate clusters can be obtained. To evaluate its accuracy, the proposed model is tested extensively by using published time-series datasets from diverse domains. This model is more accurate than any existing work and is also scalable (on large datasets) due to the use of multi-resolution of time-series in different levels of clustering. Moreover, it provides a clear understanding of the domains by its ability to generate hierarchical and arbitrary shape clusters of time-series data

    Clustering of large time series datasets

    No full text

    A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data.

    No full text
    Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones
    corecore